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Abstract 

In this work we study the set size distribution estimation problem, where elements are randomly sampled 
from a collection of non-overlapping sets and we seek to recover the original set size distribution from the 
samples. This problem has applications to capacity planning, network theory, among other areas. Examples of 
real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow 
size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The 
recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements 
that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and 
heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss 
practical implications of our findings. 

Index Terms — Cramer-Rao lower bound, Fisher information, set size distribution estimation. 

I. Introduction 

Networks are increasingly large and complex, posing tremendous challenges to their characterization in 
the wild. Characterizing network structure (e.g. degree distribution), network traffic flows (e.g. TCP/IP flow 
sizes in communication networks), node labels (e.g. group memberships), is usually impossible without 
resorting to sampling due to the size and scale of current networks. Practitioners often sample networks to 
estimate their characteristics. Many problems in network characterization through sampling can be mapped 
into the class of set size distribution estimation problems. The set size distribution estimation problem 
is stated as follows. Consider a collection of non-overlapping sets whose elements are probabilistically 
sampled. The problem is to estimate the original (pre- sampling) set size distribution based on the samples. 

Set size distribution estimation has several applications. One example of particular interest is the 
estimation of in-degree distributions of on-line social networks, where nodes represent people and a 
directed edge represents, for instance, one or more messages exchanged between two pairs of nodes. 
By monitoring message exchanges one samples a fraction of the edges. Using these samples we want to 
estimate the in-degree or out-degree distribution of nodes. The set size distribution problem also manifests 
itself in other areas, including Internet traffic monitoring, e.g., estimating the size distribution (in packets) 
of TCP/UDP flows |2), and in next generation Internet capacity planing, such as estimating the number 
of copies of a movie in a CDN of next-generation routers. Fortunately, simple maximum likelihood [|2j 
or Bayesian- style estimators exist, even when we are unable to observe sets without observed elements. 

Despite the importance of characterizing set size distributions, to the best of our knowledge no deep 
analysis of set size distribution estimation exists in the literature. We fill this gap and show that set size 
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distribution estimation exhibits intriguing abnormal statistical properties. To best illustrate our results, 
consider the estimation of in-degree distributions of arbitrarily large power-law graphs. We prove that if 
less than 50% of the edges are observed then the output of any estimator (be it frequentist or Bayesian) 
will be as truthful to the original in-degree distribution as a set of random numbers between zero and one. 
Moreover, when nodes without sampled incoming edges are unobservable, even a first order metric like 
average degree is subject to the same threshold behavior, i.e., sampling less than 50% of all incoming 
edges impedes the estimation of in-degree averages. The latter result seemly defies intuition. We prove 
these and other results in the general setting of sets with arbitrary set size distributions. In what follows 
we give an overview of our contributions. 



A. General Observations 

In this work we uncover intriguing set size distribution estimation properties, including: 

• A (finite) increase in samples may result in no reduction in estimation errors. 

Unlike estimation problems such as election polls, where a sufficient increase in samples always results in 
increased accuracy, we show, paradoxically, that in the set size distribution estimation problem an increase 
in samples may, in practice, result in no increase in accuracy. Section IV unveils the root cause of this 
odd behavior and explains when it can be avoided. Another interesting property is: 

• In networks with large set sizes (e.g., nodes with large degrees) and power-law set size distributions 
(in fact our results hold for any heavier-than-exponential distributions), randomly sampling less 
than 50% of set elements (e.g., edges of a node) provides almost no information about the set size 
distribution or the average set size. However, in networks with sub-exponential set size distributions, 
accurate set size distributions estimation is always possible. 

The above observation is interesting because power-laws have more tail probability mass and, thus, large 
sets are more likely to have sampled elements than in sub-exponential tails. However, and despite this, 
we show that if less than 50% of elements are sampled, then estimates of power-laws distributions (more 
precisely, any heavier-than-exponential distribution) are significantly less accurate than the estimates ob- 
tained from sub-exponential distributions. Our work also provides a host of equally puzzling observations, 



fully and formally presented in Section IV 



B. Outline 

Our paper is organized as follows. In Section [n] we conduct experiments on the indegree distribution 
estimation with real data. Section III presents the sampling and estimation models. Section IV presents 
our theoretic results. Section VI presents our discussion section where we analyze problems that field 
analysts are likely to face in practice, highlighting common mistakes made in the literature and how to 
avoid them. Finally Section |VII| presents the conclusions and related work. 



II. Estimation with Real Data 

In this section, we experiment with one particular application of the set size distribution problem: the 
estimation of the in-degree distribution of a network. Consider the Enron dataset, that describes a network 
composed by a group of people who exchanged emails during a certain period of time. Here each node 
represents a person and two people have a directed edge if one has emailed the other. The maximum 
in-degree in this network is 1383. 

Collecting a fraction of the exchanged messages means sampling network edges. Disregarding edge 
weights, assume the directed edges are independently sampled with probability p. Henceforth, each person 
with more than one observed incoming email shall be called a sample. Figure [la] depicts the quality of the 



in-degree estimator in ([4]) (see Section IV for the derivation) with p = 0.25, leading to N = 10 4 sampled 
individuals. The black dots indicate the true in-degree distribution, the blue curve shows a typical estimate, 
and the heat map indicates the density of estimated values across 100 runs, where red indicates high density 
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and yellow (white) indicates low (no) density of estimated values. We observe from the blue curve that 
the estimated values can be orders of magnitude away from the actual values and from the heat map we 
observe that the blue line is typical. 

In what follows we illustrate the effects of varying the number of samples N or changing the sample 
probability p separately. To vary N while keeping p fixed, we draw a node in-degree directly from the 
in-degree distribution of this network and subsequently sample its edges. We repeat this process until we 
obtain iV observed sets. This can be seen as sampling a larger (smaller) network that has the same degree 
distribution. 

We make two main observations: 

1) Increasing the number of samples yields no reduction in estimation errors. This is an odd 
behavior. We know from estimation theory that the error should decrease by \[M when the number 



of samples is increased by a factor of M. Figure lb shows the corresponding results for N = 50 x 10 



We observe that the estimated fraction of nodes of each degree can still be very far from the actual 
values. 

To make it clear that the accuracy gain from increasing the number of samples is not in agreemeent 
with theory, we compute the estimate error obtained when we vary the number of samples N G 
{5, 10, 20, 50, 100} x 10 3 , for p = 0.25. The error is first measured in terms of the Normalized Root 
Mean Square Error (NRMSE), which is defined as 



NRMSE (" 



0i 

where 9i and 9i are the estimated and true fraction of degree i nodes, respectively. Then we take 
the average NRMSE from the head (degrees up to 10) and the tail (degrees larger than 10) of the 
distribution separately. 



Surprisingly, we observe in Figure lc that there is almost no improvement in accuracy across 
different sample sizes, even when we compare 5 x 10 3 and 10 5 samples. We also display in this 
figure the expected reduction in the NRMSE for both head and tail by dashed lines. It turns out that 
the error does not decrease as we would expect. This raises the question of why, which we address 
in Section I 



2) For much larger values of p, the error starts to decrease with the number of samples. According 



to Theorem 4.1 that we describe in Section IV the difficulties experienced above arise due to the 



use of small sampling probability (p < 0.5) with heavy-tailed distributions, ant not due to a lack 



of samples. Hence we repeat the experiment using p = 0.9. Figures [Id] and [Te] show the heat maps 
for N = 20 x 10 3 and N = 10 5 . As opposed to what we previously saw, increasing the number 
of samples makes the estimates closer to the true in-degree distribution. The accuracy gain as a 



function of the number of samples is shown in Figure If In fact, we observe that the NRMSE 
does decrease as expected for the head of the distribution, but not for the tail. Why are there two 
distinct behaviors, one for the head and one for the tail? Why did it help to increase the number of 
samples when estimating frequencies of small degrees for p = 0.9, as opposed to what we observed 
for p = 0.25? Is it possible to make the NRMSE of the tail to decrease as fast as the NRMSE of 
the head? 

In order to investigate the questions we pose here, we study the Cramer-Rao Lower Bound (CRLB) of 
the set size estimation problem. This give us a lower bound on the estimation errors based on the amount 
of information contained in the samples, measured in terms of Fisher Information. Moreover, we apply 
the CRLB to the estimation of the in-degree distribution and average in-degree. 

III. Model 

Let Sic be a nonempty set of elements, k — 1, . . . , m, with S^nSj — 0, i,j — 1, . . .m, i ^ j. Let S& = 
denote the size of the A;-th set and assume set sizes are i.i.d. with distribution S fe ~ 6 = (9i, . . . , 8w), 
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Figure 1. The first row (a-c) shows the results for p — 0.25, while the second row (d-e) shows the corresponding plots for p = 0.90. (a- 
b,d-e) True degree distribution, one example of estimate and heat map indicating the ocurrence rates of the estimate values for TV — 10 x 10 3 
samples (first column) and TV = 50 x 10 3 samples (second column), respectively. The red color in the heat map indicates high density of 
estimated values and yellow (white) indicates low (no) density of estimated values. A subplot shows a zoom-in for the first degrees. (c,f) 
Average NRMSE of the head and the tail of the distribution for TV G {1, 5, 10, 20, 100} x 10 3 . Dashed line shows how the error should vary 
with the number of samples. In (c) we have the typical behavior of wrong estimates. Increasing the number of samples does not improve 
the quality of estimates. On the other hand (f) shows the typical behavior of correct estimates. Here increasing the number of samples 
yields lower estimation errors of the head. 



W > 1 k > 1. We assume W finite (W < oo). The model breaks nodes (edges) into groups (sets) and our 
task in what follows is to characterize those groups from incomplete observation (sample) of these sets. 
To illustrate the model, consider a directed graph; the set of incoming (outgoing) edges of a node k is 
represented by Sk, is the indegree (outdegree) distribution, and W is the maximum indegree (outdegree). 
Another straightforward example is representing IP traffic of a communications network, where k is a 
TCP flow, Sk is the set of TCP/IP packets that constitute flow k, and W is the maximum observable flow 
size. 

Sampling 

We observe (sample) elements of Sk, k = 1, . . . , m, with probability p - a process also known as 
thinning. Let a(Sk) be a random function that returns the number of observed elements of <S& . Elements 
are sampled independently (i.e., the sampling process is Bernoulli) and thus, 



P[a(S k )=j\S k = i] = 




j > 0,i > l,i > j, 
otherwise, 
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where q = 1 — p. We assume that when no elements of a set are observed, then the set as a whole is not 
observed, i.e., S k is said to be observable if a(S k ) > 0. Thus, we denote 

§ = {a(S k ) :a(S k ) > , k = 1, . . . , m} 

the size of the observable set sizes. Let N = |§| denote the number of observed sets. 

Estimation 

We start by considering p = 1, that is, all elements of all sets are observed. The minimum variance 
estimator of Qi is 

l{S k = i} 



T-(Si, . . . ,S m ) — 



N ' 

k=l 

where iV = m. To measure the accuracy of the estimates we consider the mean squared error (MSE) - 
a.k.a. quadratic loss - of the estimates 

MSE^'OSx, ...,S m )) = E[(T-(S h ...,S m )- 9 t ) 2 } = ~ ° i] < -L. 

m 4m 

Thus, for p — 1 the estimation error decreases as 1/m, recalling that m is the number of sets. 

Unfortunately, accurately estimating 6 when p < 1 is significantly more challenging. Recall that a set 
<Sfe is said to be observable if a(Sk) > 0. We upfront assume that a unobservable sets cannot be used in 
the estimation process. This means that our estimator only has access to sets S k where a(S k ) > 0. Here 
we need another function T$ that takes the observed set sizes § as inputs and outputs an unbiased estimate 
Tj(§) of 9i, i.e., £J[T»(S)] = Q{. In what follows we focus on unbiased estimates; our discussion section 



(Section VI) extends our results to biased estimators. The Mean Squared Error (MSE) of our estimator is 

MSE(T i (S)) = E[(T l (§)-^) 2 ]. 
The function Tj that minimizes the MSE with respect to sets of size % = 1, . . . , W is 

T*(S) = argminMSE(Ti(S)), 

Ti 

s.t. E[T?(§)} = 6i. 

IV. Results 
In this section we present and discuss our results. 

Theorem 4.1: Let 6 = (9i, . . . , 6w) be a distribution where 3i such that 0$ < 1/2 for all i > i . Recall 
that N < m is the number of observed sets out of the total m sets. We show that, as W — > oo, for N 
sufficiently large any unbiased estimator Tj(S>), i > 1 is such that: 

1) When 6 W decreases faster than exponentially in W, i.e., - \og6 w = tu(W), MSE(Tj(S)) = 0(1 /N) 
for < p < 1. 

2) When 9 W decreases exponentially in W, i.e., \og6w = Wloga + o(W) as for some < a < 1, 

a) log[MSE(Ti(S))] = Q{W/\ogN), if p < a/(a + 1), 

b) MSE(T,(§)) = Q{W 2i+1 /N), if p = a/ '{a + 1), 

c) MSE(Ti(S)) = 0{l/N), ifp> a/ '{a + 1). 

3) When 6w decreases more slowly than exponential, i.e., — \og6w = o(W), 

a) log[MSE(Ti(S))] = Q{W/logN), if p < 1/2, 

b) MSE(T,(§)) = 0(1/N), ifp> 1/2; more precisely, 

i) MSE(T,(§)) = u(l/N), ifp = 1/2 and T^J^s = 

ii) MSE(T;(S)) = 0(l/N), if either p > 1/2 or p = 1/2 and Y% =1 j 2i 6j = 0(1). 
Theorem 4.2: The bounds on the estimation error of the average set size are analogous to the set size 

distribution bounds. 



In what follows we explain how we sketch out the proof of Theorems 4.1 and 4.2 and describe their 
implications. 
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A. Lower Bound on Estimation Errors 

In this section we derive a lower bound on the Mean Squared Error (MSE) of Tj(§), i — 1, . . . , W. For 
this we use the Cramer-Rao (CR) lower bound of Tj(S), which gives the smallest MSE that any unbiased 
estimator Tj can achieve. 

Recall that a set is observable only if one or more of its elements are observable. The probability that 
a (random) set S is observed and has j elements is defined as 



i-3 



b Jt (p) = P [a{S) = j | a(S) > 0, \S\ = i] = ^— , if < j < i < W, (1) 

and = otherwise, where q = 1—p. Let dj(0,p) denote the fraction of observed sets with exactly 

j observed elements. From ([T]) we have, j = 1, . . . , W, 

d j (0 } p) = P[a(S)=j\\S\>O} 



P[a(S) = j\a(S) > 0, |«S| = i]P[\S\ = i\ a(S) > 0] 

i=j 
w 

J2bji(p)U9)- (2) 



i=j 



where 



UO) = P[\S\ = i | a(S) > 0] = f ] (3) 

is the distribution of the set sizes of the observed sets. Or, in matrix notation, 

d(0,p) = B(p)(l>(0), 

where d(6,p) = (di(6,p), . . . , dw(6,p)) J and B(p) = [bji(p)),j, i — 1, . . . , W. To illustrate the distribu- 
tion d(d,p) in our model, note that for a random observed set S, 

a(S)~d(0,p) 7 

with likelihood function 

f(j\0) ee P[ a (5) = j | 0] = {B{p)4»{e))i = d^Olp). (4) 

In what follows for simplicity we denote dj(0,p) as dj(0), j — 1, . . . ,V^. 

Recall that we are interested in functions Tj(§) that take as input the observed subset sizes S and outputs 
an unbiased estimate Tj(§) of 0j, z = 1, . . . , W. Moreover, we want these estimates to be accurate, i.e., 
MSE(Tj(§)) must be low in respect to Otherwise, the estimate is of little use to the practitioner for 
set sizes of interest, as illustrated in Figure [T] 

Thus, it is important to find attainable lower bounds of MSE(Tj(§)). The Cramer- Rao Theorem states 
that the MSE of any unbiased estimator T is lower bounded by the inverse of the Fisher information 
matrix divided by the number of independent samples N, provided some weak regularity conditions 
hold [9, Chapter 2], i.e., 

MSE(T 4 (§)) ee E[(Ti(S) - O,) 2 } > ^ ^ Jn , 1 < i < W. (5) 

where ( J^ip))^ 1 is the inverse of the Fisher information matrix of a single observed set defined using 
the likelihood function Q as 

(j(0) (r))) - y 9 In f(j | 0) d In f(j | 0) _ ^ dd^jO)) ddMW) 1 
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g iven Ei=i ^ = L 

The lower bound in Q is known in the literature as the Cramer-Rao lower bound or CRLB for short. 
Let T*(S) be an unbiased estimator, % = 1, .... We say T*(S) is asymptotically efficient if MSE(i;*(§)) 
approaches the Cramer- Rao lower bound in ([5]) as N — > oo. We show in Appendix |D"| that the Maximum 
Likelihood Estimator is asymptotically efficient on the set size estimation. The implication of having an 
efficient estimator is that the lower bounds provided in this paper are tight for iV sufficiently large. In 
what follows we represent J^ e \p) as for simplicity. 



B. Obtaining the CRLB 

In what follows we derive closed-form lower bounds for the MSE of any unbiased estimator T, as a 
function of the original set size distribution 6, the sampling probability p, and the number of observed sets 
N, where we ignore the constraint Y^iLi 0* = 1- Deriving a closed-form solution for the inverse of 
is no easy task as matrix is a function of df(j\0)/d9i, i — 1, . . . , W, which makes jw a non-linear 
function of 6. However, observe that the likelihood function f*(j\<f>) = P[a(S) = j\<p] (where S is a 
random observed set) is linear with respect to <p 



as J(*) = [J>f],i, k = 1, . . . , W, where 



It is worth noting that f*(j\<f>(0)) = f(j\6). The Fisher information matrix with respect to 4> is defined 

sre 

^ _S^ dd A<t>) ddj(cf)) 

given Ei=i 4>i — 1; an d because dj(0) is linear in 0, combining (|7]) and ([8]) yields 

= y 3 ( J9 )- 1 diag(y3( J 9)(73)- 1 (y3(p)- 1 ) T - <r3<r3 T (9) 

Here the term 00 T corresponds to the accuracy gain obtained by considering the constraint El^-i 0« = 1 
(see Tune and Darryl [[8j for more details and Gorman and Hero [[3j for the general formula on adding 
equality constraints to the CRLB). Quantitatively we can safely ignore the constant term 00 T as we are 
interested in the behavior of (J^) -1 as a function of W and the elements of cjxp 1 are typically small. 
All that is left to do is to find a relationship between (J^) -1 and (J^) -1 . 

We now obtain (J^)^ 1 from (J^)^ 1 through a multi-variate extension of the single variable chain 
rule. As f*(j\<p(0)) = f(j\0) the chain rule yields 



df(j\e) dffaWi)) dfih) d^(d) 



d6i d6i d6i 



Using the Jacobian VH = [h ik ], h ik = d8 k ((p)/d(f)i with 8 k ((p) as given in ([3]), we arrive at the equivalent 
multivariate rule [9, pp. 83] to express (J 1 ^) -1 as 

= Vi?(J (<A) )- 1 Vif T '. (10) 

Using ([9]) - detailed derivation relegated to the Appendices - we find: 

[tf* ) )- 1 h= E (^ 2 7-)(-)(- 1 )^'(^- 1 )(^- 1 KW- dD 

fc=max(i,j) \J/ \ / 
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Substituting ( fTT| ) into ( |T0| ) - and through a variety of algebraic manipulations detailed in the Appendices 
- yields 



[(JW)- 1 ] 



1 



w w 



i] 2 \ (1 — q 



i\2 



[(^ ) )^+^ 2 EEti 



where = V^i /(l — q^). Note that term of @ is proportional to the CRLB of 0, [(J^)" 
but terms A 2 (i) and A 3 (i) are more involved. Through a series of algebraic manipulations of terms A\, 
A 2 , and A3, all detailed in the Appendices, we see that (Ai(i) + A 2 (i) — A 3 (i)) grows as a function of 
(1 —p)/p and W, yielding the relation 



(12) 



MSE(T;(§)) = Q 



v 







\ 



\ 



N 



(13) 



/ 



where the number of observed sets N is large but constant in respect to W. 



The result in ( [13] ) is very powerful as it gives simple estimation error lower bounds as a function of the 
sampling probability p and the original set size distribution 0. A close look at ( |T~3| ) reveals - a detailed 
exposition is presented in the Appendices - that when ((1 — p)/p) l 6i = Q^ 1 ) for all i > i*, i* <C W, 



then the sum in ( [13] ) grows at least as fast as the a harmonic series, which grows as log W. On the other 
hand, we see in the Appendices that when ((1 —p)/p) l 9i = 0(i -/3 ), /3 > 1, then the sum in ( [13] ) converges 
to a constant, more precisely, it grows no faster than a Riemman zeta function with parameter (3, ((/3). 

Thus, for a given 6 with W ^> 1 the CRLB suffers from an interesting sharp threshold related to 
the sampling probability p. If p is below this threshold no estimator Ti of 8i ,i = 1, ... , W, is able to 
achieve accurate estimates of 4 . Below such p threshold, and as long as the number of sampled sets, N, 
is large enough, there exists estimators Tj(§) ,i = 1, . . . , W, that can achieve accurate estimates. To be 
more specific, we look at the threshold behavior of p by breaking down into three broad classes of 
distributions: 

1) If Ow decreases faster than exponentially in W there is no threshold behavior of p. This is because 
if — \0g6w = w(W), then there exists a constant a < 1 such that ((1 —p)jp)^Qj < a\ j = 1, 2, 



Hence, the sum in ( p"3] ) converges to a constant for any p > 0, yielding MSE(Tj(§)) = 0,(1 /N), for 
< p < 1. Detailed arguments are presented in the Appendices. 



2) If \og6 w = W log a + o(W) then if p < a/(a + l) yields ((l-p)/p) j 6 j 



Q(l), Vj. Hence, 

the sum in ( [T3] ) diverges with W. On the other hand, if p > a/(a + 1) the sum in ( [T3] ) converges 



3) 



to a constant. Detailed arguments are presented in the Appendices. 

Finally, if 9 W decreases more slowly than exponential then if p = 1/2— e, e > 0, yields ((1—p) jp)i > 
(1 + e/2) J , Vj. Hence, because 9j decreases more slowly than an exponential, the sum in ( [T3| ) 



diverges with W. If p > 1/2 the lower bound in ( [13] ) converges to a constant. Detailed arguments 
are presented in the Appendices. 



To illustrate our results, we compute the MSE lower bounds in ( fT2[ ) where is the Enron in-degree 
distribution truncated at different values of W. More precisely, we take the in-degree distribution of the 
Enron dataset (discussed in Section [II]) and truncate the maximum degree to W by accumulating in W all 
the probability mass previously corresponding to degrees greater than W . The Enron in-degree distribution 
is a (truncated) heavier-than-exponential distribution. 
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(a) p = 0.25 (b) p = 0.90 

Figure 2. CRLB of the in-degree distribution of the Enron dataset for N = 10 4 samples. 



Figures [2^ and [2J3 show the MSE lower bounds for p e {0.25,0.90}, respectively. We observe that 
for p = 0.25 (Figure ETa)) the MSE lower bound grows with W even for small degrees, as predicted 
by Theorem |4.1| While, for p = 0.9 (Figure ^b)) the MSE lower bound behaves (mostly) independent 
of W, also as predicted by Theorem 4.1 These results corroborate to explain the simulations results in 
Section M 

Other metrics besides the set size distribution are of interest. In what follows we observe that, surpris- 
ingly, the accuracy of the average set size follows similar lower bounds of set size distribution estimators 
, W. We then analyze the accuracy of entropy estimates. 



T u i = 1 



V. Accuracy of Estimated Averages 
In this section we focus on the accuracy of the average set size. 



A. Average set size 

The average set size is m e = ^2j =1 j&j, or, alternatively, in matrix form 

m B = [l,...,W]0 T . 

Let 

[i,...,w]. 



dme 



00i 



36 



w 



Let m(S) be an unbiased estimate of the average set size. Using a similar argument used to obtain ( fT0| ) 
(see Appendices) yields 



MSE(m(S)) > ^(JW 



_ X VM T 



ve 



VM fVH^^VH^ VM T 



(VMVH 



V0 J V9 
_! (VMVH\ J 

WW 



(14) 
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Note that 



VMVH 

WW 



w 



E« 

i=l 
W 

E 

i=l 



ik 



0i 



77(1 - g fc ) 



1-4 
77(1 -q k ) 



v (i- 

k — nig 
77(1 -q k Y 



(15) 



where again 77 = £),- =1 4>j(Q)/(^ ~ Substituting ( [T5| ) into ([14]) yields 



MSE(m(§))>-^^ 



j -m e 



i=i j=i 



77(1 - g») 



1 1 

w w 



^^(l-qj)(l-q 



77(1 — 



•" J- ™ 2 [(^ ) 



^(i-^'XW) 



2m W J " JW >'' 



1 1 



ir 



i=i 



Tft 

-J-A 2 (i)-2m e r)(mo + -6 1 ) 



with A 2 (i) as given in ( fT2] ). Detailed derivations are found in the Appendices. A closer look at A 2 (i 
reveals 



1 

N 



Of [i + v 



w w 
0=1 i=i 



P 



n 



6, * 



N 



V 



(16) 



/ 



W, in (13). 



Note that the lower bound of m(S) in ( |T6| ) is the same as the lower bound of Tj(§), z = 1, 
Hence, a theorem in the lines of Theorem 4.1 can be stated for m(§): 

Theorem 5.1: Let = . . . , be a distribution where 3i such that 6i < 1/2 for all i > i . Recall 
that N < m is the number of observed sets out of the total m sets. We show that, as W — > 00, for 
N sufficiently large any unbiased estimator of the estimated mean of 6, m(S), must obey the following 
properties: 

1) When 9w decreases faster than exponentially in W, i.e., — log#^ = u(W), MSE(m(§)) = 0(1 /N) 
for < p < 1. 

2) When 9 W decreases exponentially in W, i.e., log^iy = Wloga + o(W) as for some < a < 1, 

a) log[MSE(m(S))] = Q(W/ log N), if p < a/(a+l), 

b) MSE(m(S)) = n(W/N), ifp = a/ '{a + 1), 

c) MSE(m(S)) = 0(1 /N), if p > a/ (a + 1). 

3) When 8w decreases more slowly than exponential, i.e., — logOw = o(W), 

a) log[MSE(m(S))] = Sl(W/ log N), if p < 1/2, 

b) MSE(m(S)) = 0(1/ N), if p > 1/2; more precisely, 

i) MSE(m(S)) = u(l/N), if p= 1/2 and J27=ij% = "(1), 
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ii) MSE(m(S)) = 0(1/N), if either p > 1/2 or p = 1/2 and Eji J 2 #j = 0(1). 

Theorem |5T states that estimating the average set size is in the same order of hardness as estimating 
the entire set size distribution. 

It is interesting, though, to verify if the same property holds in the case of the average size of the 
observed sets, i.e., the average set size in respect to <fi, 

w 

3=1 

In what follows we show that the difficulty in estimating is a function of W and is affected only by 
the first and second moments of <f>, that is, as long as and 



(2) 

m 



w 
i=i 



are finite, can be accurately estimated if enough samples, N, are collected. 
Let rh^(E>) denote an unbiased estimate of and let 



MSE(m^(§)) = E[(m^(S) - 



rriH 



denote the MSE of m^(S). After applying a variety of algebraic manipulations detailed in the Appendices 
we arrive at the following inequality 



MSE(m^) > 



(i,...,w)(j^)-\i,...,wy-mi 

N 

EEE,-C)(^<^, (1 -«« 



k=i t=i j=i 
w 



2s p (i _ qi) n*)/*- 



,i=i 



More interestingly, we show that 



*;<s> = %f + (i - 1) (i7) 



is an unbiased efficient (minimum variance) estimator of m^, yielding 

w 



MSE K(S )) = ( £ «(* + ^ = g + «)* - < ) IN. 



Alternatively we can rewrite the above as 

MSE(m ) = O 



(2) 2 
m 4> ~ m 4 

N 



Hence, MSE(m^) is lower bounded by the variance of the observed set sizes. A simple explanation for 
this behavior is likely found in the inspection paradox. Even if we know the sizes of the sampled sets, the 
mere fact that the set is sampled means that it probably has a higher than average size, as the probability 
that a set of size i is sampled is 1 — (1 — p) 1 . Larger variance in the set sizes means larger biases towards 
sampling larger sets, which in turn makes it harder to unbias these samples. 
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VI. Discussion 



We divide this section in three parts. Section VI-A considers the initialization of estimation procedures. 
Section VI-B shows that no clever way to process the data S exists that would allow an estimator to 
violate the bounds provided in Section IV Finally, Section VI-C shows that our results can be extended 
to encompass biased and Bayesian estimators. 



A. Initialization of Estimation Procedures 

As previously stated, eq. ([4]) can be used to derive a maximum likelihood estimator (MLE) for 6. From 
the MLE one could either use a constrained non-linear optimization method to maximize the likelihood 
function directly or use the Expectation-Maximization (EM) algorithm to write an iterative estimation 
procedure. In the latter case, the procedure consists of an initialization step followed by a loop of two 
steps known as the E-step and M-step. We discuss two issues that arise when EM is used to estimate the 
set size distribution. 

In EM, the solution to which the algorithm converges to depends on the initial guess. Therefore, 
in order to have an unbiased estimate, one must choose a point uniformly at random from the space 
of possible values. Although it may seem reasonable to choose values for each uniformly in [0, 1] 
and then normalize them, it turns out that this does not yield uniformly distributed initial guesses. One 
way to correctly generate the initial guess is to draw from the Dirichlet distribution with W parameters 
a = (1,...,1), since the Dirichlet PDF at point 6 is proportional to Yli=i ®T~ X ■ 

Nevertheless, such an initialization combined with the other two steps of EM will give us estimates 
8i E [0, 1] hence producing biased estimates. Therefore, it is possible that EM achieves an MSE not in 
agreement with the CRLB we derived previously. This is the case when the number of samples N is 
small and, consequently, the diagonal of G has relatively large values (possibly greater than 1). On the 
other hand, for large N, the number of observed sets with size i will converge to a Normal distribution 
with mean 9i and small variance. For small enough variance, restricting 9i to be between and 1 does 
not affect the final estimate significantly and thus the CRLB accurately bounds the MSE. 



B. An Application of the Data Processing Inequality 

The data processing inequality [flOl states that no function of the data may increase the amount of Fisher 
information already contained in the data. Thus, the bounds in Theorems 4.1 and 5. 1| remain unchanged 
regardless of how the data is pre-processed, no matter how clever the pre-processing approach is. This, 
of course, encompasses any type of noise filters or machine learning methods. 



C. Impact on Different Types of Estimators: Bayesian, Frequentist, Biased and Unbiased 

To extend our results beyond unbiased estimators we explain the connection between Fisher information, 
the Cramer-Rao bound and biased estimators. We also extend our results to Bayesian estimators (including 
maximum a posteriori estimators). 

I) Extension to Biased Estimators: Let h{9i) = E[Ti(S)] — 9{ be the estimator bias. Then (see for 
instance Ben-Haim and Eldar (TJ) 

2 



mse(ihs)) > (i + ^) [(jW)- 1 ; 



assuming db(9i)/d9i exists. Note if the bias derivative satisfies —2 < db{9i)/d9i < 0, then the biased 
estimator has lower MSE than any unbiased estimator. H owever , we believe it is unlikely that a large 



value of [(J^) l ]a (as large as 10 160 as seen in Section IV-B for the Enron e-mail network) can be 
compensated by a biased estimator. 
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2) Extension to Bayesian Estimators: Let 6 now be a random variable with prior distribution ixg. A 
Bayesian estimator adds Tig as extra information to the estimation problem. The Fisher information of the 
prior is 

d In 7T0 d In iig 
~d6i d9~ _ ' 

The Fisher information obtained exclusively by the data is presented in ([6]). And the total Fisher 
information prior + data is |9[ pp. 84] 

JM = J(P) + J(°). 

The Cramer-Rao bound of a Bayesian estimator Wj(§) of 9i with prior 7re yields |9| pp. 85] 

MSE(W 4 (S)) > (j^y 1 = (J (p) + JW)" 1 , 

and thus, if the data contains little Fisher information then a decrease in the MSE is due to the information 
contained in the prior n g . 

VII. Conclusions & Related Work 

In this paper we give explicit expressions of MSE lower bounds of unbiased estimators of the distribution 
of set sizes 9 and the average set size mg with sampling probability p. We show that the estimation error 
of 6 grows at least exponentially in W, when \og9 w = W \oga + o(W) as W — > oo for some < a < 1, 
and p < a/ (a + 1), or when log 9 W = o(W) as W — » oo and p < 1/2, which indicates that there unbiased 
estimators of some distributions 6 are too inaccurate to be useful for practitioners. Moreover we show 
that unbiased estimates of m« suffer from similar problems. 

Not much prior work exists in the literature. Hohn and Veitch [4| first observed that using a sampling 
probability of p < 1/2 poses problems in the context of two specific estimators for the flow size distribution 
when the distribution obeys a power law. In particular, they showed that their estimators are asymptotically 
unbiased with decreasing error as the number of flow samples increases when p > 1/2 but not when 
p < 1/2. Our work shows that this is a fundamental result of set size distribution estimation and not 
specific to any one or two estimators. Ribeiro et al. [|7J was the first to introduce the use of Fisher 
information as a design tool for flow size estimation. Experiments reported in that paper suggested that 
there is little information when p is small and showed how this information can be significantly increased 
with the addition of other data taken from packet headers. Last, Tune and Veitch [[8j applied Fisher 
information to compare packet sampling with flow sampling. In the process of doing so, they obtained a 
variety of useful Fisher information inverse identities, which we rely on in this work. 
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Appendix A 
Set size distribution proofs 

Let B{p) = [bji(p)],j, i = 1, . . . , W be a matrix whose elements are given by 



bjiip) = P [a(S) = j | a(S) > 0, \S\ = i] 



»-3 



1 - q l 



if < j < i, 



(18) 



and bijip) = otherwise, where q = 1 — p. 

Lemma A.l shows a closed formula for the inverse of B(p). 
Lemma A.l: B(p)^ 1 = (i, j — 1, • • • , W), where 



Qp-'i-qy-^l-q*) i>3 
i < j. 



Proof. Let B(p) 1 = with b^p) defined above. We first show that Y = B{p)B(p) 1 is an identity 

matrix. Consider element (j, i) of Y: 



(19) 



i=i 



We have three cases: j > i, j = i, and j < i. 

Case 1, j > i: eq. (19) yields = since bji(p) = 0, V/ < i and b* u (p) = 0, V/ > i. 
Case 2, j — i: Here bji(p)b^(p) = 0, V/ ^ j and (19) yields 



1 - q3 



Case 3, j < i: eq. ([19]) yields 



Vji = J2^y ly ' ''I 



1=3 



DC 



' — J 



p? % q 



1-3 



1=3 V J 



= pi-'q'-i 
= 

Thus, yjj = 1, Vj and yji = 0, Vj ^ i, which concludes our proof. 



□ 
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Lemma A.l directly yields the inverse of the Fisher information matrix of a single observed set, 
as seen in the following lemma. 

Lemma A.2: (J^)- 1 = [[{J^YW = 1, 2, . . . , W), where 



[(J^)- 1 ]* = E (-) 2k ( k )( k )(-ir- 3 (^-m^ -l)d k (0) (20) 

fc=max(i,j) VP/ WW 

Proof. Denote i#*)(p) = [i2j-f(p)] = 5- 1 (p)diag( J B(p)0)- 1 , where i*Jf(p) = 6£(p)dj(0). Based on 
Lemma |A.1| and eq. ([2]), we have 

J#>(p) = | C)P" i (-?) < " i (l - » > * ( 2i) 

Since jw = i?(^)(p)(i?(p) _1 ) T , [(J^) -1 ]^ is computed as the following equation based on Lemma 
and eq. ((21) 



A.l 



fc=i 

= L 



fc=max(i,j) ^ 
W / \ 2k 



(S G)(-)(- i )" , " i '('" , - i )('" i - i ) d 'W 



fc=max(i,j) 



□ 



Lemma A.3: = [[( (i, j = 1,2,...,W), where 

K j J ""^ 2 V (W) a ^fea-^a-^) 4 ^(i-9*)(i-^V 

where 77 = X^=i - 

Proof. The relationship between (J^) -1 and (J^) -1 is given by 

(jW)- 1 = VH(jW)~ l VH T , (23) 

where = [h ik \ with = d6k(<f>)/d(f)i. Hence 

- S 1-^/(^(1-9')) ,• _ L. 

where 77 = XlfcLi 0Jfc/(l ~~ <Z fc ) is a constant. Note that from eq. ^ we have 9{ = <f>ij{rj{l — q 1 )). Therefore 
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the diagonal elements of (J^) 1 can be written as 

w w 



j=i k=i 

w w 

EE 

3=1 k=l 
W 



77(1 — g A 



ft 



77(1 - g?) 



+ 



E (^y) tt^-i. (- 



ft 



3= 



E 

*s=i 



ft 



77(1 — g A 



[(jW)- 1 ; 



77(1 - qJ) 

1-ft 
77(1 - g*) 



+ 



1-ft 
77(1 - g*) 



[(JW)- 1 ] 



( ga 



(W) 2 l ^ 



-(l-^)(l-gi) 



We split eq. ( |22| ) in three parts to carry out its analysis: 

w w 

I I I /'■.''! I ■'- I 

[(JW)- 1 



7/ 2 I (1 — g*) 2 



2 ft £ 



Mi) 



(24) 

□ 

(25) 



A. Analysis of A\{i) 

Based on Lemma |A.2| and eq. ([2]), we have 



Lemma A.4: 



W-i 



3=0 



where 77 = Y%=i Ml ~ 9*) and 9ij = ELo (?) © 
Proof. 



[(JW)- 1 ] 



E(p) (■) (-»-v-D ! «) 



k\ ( j -i 



q J * — ' V i J \k — i 

k 



(q/pY 



(<T ~ I) 2 E 



i=o 



+ j\ <l" J Q; j!l; l 
i ) 1 - g l +J 



where ^ = ELo.m(D(^) 

Since — g l ) = ft • 77, we can eq. ( j26] ) as a function of 



(26) 



(27) 



W-i /• , -\ 

K^)- 1 ]* = v (<T - 1) 2 E ( z yy +j o i+m - 



3=0 

Therefore 

W-i 



i=o V Z 7 



Lemma A.5: We have the following bounds for Ai(i): 

i oo 

A 1 ( i )<C l Y,c lk Y^ 1 {k<m+j) 2t {-T +J 9 



k=0 j=0 



i+j 

p. 



and 



where 



and 



W-i 

A 1 {j)>c l c ii £ .r'( 7 )"^,, 



i!) 2 



(.\ r — A; — 1 



Proof. Since the z-th derivative of (q/p) l+k with respect to q/p, is 
we have the following equations for 

^ fc=0 i=l v 7 

1 ^iv 1 - fj\d i (g/p) i+k 



( q -yy( j ) 

i\ y \k) d( q / P y 



l (i)l d l (ELo@(^) 



i+k 



i\ K p d(q/pY 



! <T^( ff /p)»(l + q/p)' 
i\ ^p d(q/pY 
Using a general form of the product rule (6} pp. 318] yields 

min{ij} , . fc_l i-k-1 

*=^)* e (;)(irn«-o(f)'nc- 

where to simplify the expression we define fljlo ' ' ' = 1- 



Substituting ( |3T| ) back into ( |28| ), we obtain the following expression for A\(i) 

i W—i i k—1 

A 1 {i) = aj2 c ik ^ # no + o - l )(i/p) i+je i 

fc=0 j=0 J=l 

where 



z=o 



a 



(i\y 



and 



i-fc-1 



II (*-0, fc = 0,...,i;i = l,...,W. 



z=o 



We have the following upper bounds for Ai(i), 



W-i 



A l (i) < Cl j2c lk j2nk<m+j) 2i (-y +3 9 



k=0 j=0 

i oo 



P 



i+j 



i+j- 



k=0 j=0 

A lower bound is obtained by noting that 

i k—1 k k 



1=1 



1=0 



1=1 
k 



=1 



i=i 



The latter is greater than or equal to j whenever j > i(i — 1) yielding 

W-i 

A 2i('i\ i +0 l 

'i+j- 



A 1 (i)>c i c ii Y .r'( 7 )"^ 



j=i(i~i) 
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B. Analysis of A 2 (i) 



w w 

EE 



[(jW)-i] 



^(i-^Ki-^) 



First, note that 



Also, 



tv 

E 

fc=i 



P 



TV TV W ( k } ( k \ [ 1 



EEE 

i=l i=i fc=i 



E© a ^)EEG)G:i(- 5 )- 



fc=l 

TV / n 2fc 



fe=i 



i=i i=i 
it 



E(!) 2 '*wfE ff )(-«)■ 



»=i 



using (63) 



tv ✓ x fc iy/x2fe 

fe=i fe=i ^ ^ fc=i 



TV ✓ xfc W 7 / \2it 



4(0) 



TV / \ fc TV / .\ 

E i E 6>v-%, 

fe=i v ■ p/ j=i v 7 



ir 



i=i fe=i v 7 

TV 

—r^y^dj. using ( [65] ) 



(36) 



(37) 



TV , x 2A 
E ( p 



TV / v 2k W , .x 

*(*) = EU) Er>v-^ 

fc=i V - F/ i=i v 7 



IT' 



J / -\ / \ k 

3 \ I 1 



"E^E W Vp 

j=l k=l v 7 yF 

(0 

' TV / v j TV 



using (64) 



(38) 



Replacing eqs. (37) and (38) into (|36|) yields 



i=l j= 



(W W , s j w 

2 E^ + E(;) ^-E 

1+ ,(e^ + e©4 



Therefore, 



( ( W W / \ 3 



Note that A 2 (i) is positive and may diverge or not depending on the summation Y^j=i 



C. Analysis of A 3 (i) 
Note that 



We also have 



EC)(-f)E(^ 



w 



3\ 3 — % 



i ) \ k — i } ' J 



<iE<-u'E 

k=i j=l 

iE(9^Eg::)c-i> 

j=i v 7 fc=0 v 



[-qYvQi- usin g (66i 



W /7 N / \ 2fc 



From eq. (42) and (43), we have 



eO© eGK^ 

,E©E(0(r-;)^ 



k — ij \p 



- (l-gi)(l-gO 



^-(-^EnV%Eu: A/ ' 



k=i 



k — ij \p 
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and hence, 

W- i ,. .x j / \ / \k+i 

A 3 a) = 2^1 - wi-^v J2{ V) qi+ie ^ E (I) (j) • (45) 



, j=0 x 7 fc=0 

^3,lW S _ ^_ 

^3,2 (i) 



Since A 3) i(i) is always finite, we only need to compare the magnitude of A\(i) and A 3>2 (i). Since 

£i=o (3 (5) < ft* we can bound b y 

i=o V * 7 

W-i /• , \ 

A!(<) - |a 3 , 2 «I > (<r 24 - 2^)77 £ . 3 jq i+j e i+jgij . 



Therefore 



j=0 

The RHS of the previous inequation is positive when 



9, < * <i. 

2q l 2 

Recall that we assumed that 3i such that 9^ < 1/2 for all i > i . Thus by examining only Ai(i) and 
A 2 [i) we can determine whether diverges or not for i > i . 

Appendix B 
Proof of Theorem |4JJ 

The lower bound of MSE(Tj(§)), given by is described for each of the three possible cases 



in Theorem 4.1 The corresponding proofs are shown in what follows. 
1) When 9 W decreases faster than exponentially in W. 

Proof. Suppose that 9w decreases faster than exponentially in W . More precisely, assume that 
— log^vy = u(W). It follows that \og(9 w /9 w+ i) — > 00 as W — > 00. Hence, for any e > 0, there 
exists a W (e) such that \og{9 w /9 W+1 ) > 1/e for W > W (e). This implies 9 W+1 /9 W < e~ 1/e for 
W > Wo(e). Given p > 0, we can choose e such that qe~ 1 / t /p < 1. We now apply the ratio test for 



convergence of an infinite sum to each of the i + 1 sums in the upper bound for A\(i) given by (29) 



(W + % + l) 2i (q/p) w+i+1 9 w+l+1 (W + % + l) 2i qe-^ e 
(W + iy i (q/p) w + i 9 w+i < [W + if* p 

for W > Wo(e) — i and the latter expression becomes less than one as W — > 00. Hence Ai(i) = 0(1) 
for < p < 1. 

A similar argument can be used to show that A 2 [i) = 0(1). Hence, = 0(1) for < p < 1. 

□ 

2) When 9 W decreases exponentially in W. 

Proof. Suppose that 9 W decreases exponentially in W. More precisely, let log 6^ = Wloga + o(W) 
for < a < 1. Recall that A 2 [i) is positive. Therefore, the logarithm of [(J^)^ 1 ^ in ( [22] ) can be lower 
bounded as follows, 

logpW)- 1 ],, > log^(i). (46) 
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In addition, the logarithm of A\(i) in ([26]) can be bounded by 

log A x {%) > W \og(q/p) + log 9 W + o(W) 
= W \og(qa/p) + o(W) 

where the latter equality follows from the hypothesis. Now, if qa/p > 1, then logAi(i) = Q(W), which 
implies log[( J^)" 1 ]^ = Q(W). Note that qa/p > 1 iff p < a/(a + 1). 

When p = a/ (a + 1), then qa/p = 1. Hence the lower bound of Ai(i) given by ( [30] ) is VL(W 2l+1 ). 
Hence, \{J {e) )-% = Q(W 2l+l ). 

Similarly to the proof for the case where 9w decreases faster than exponentially in W, we can use the 
ratio test for convergence of an infinite sum to show that for qa/p < 1, A\{i) = 0(1). Hence, it follows 
that [(J {e) )- l )u = 0(1) for p > a/(a + 1). 

□ 

3) When 6 W decreases slower than exponentially in W. 

Proof. Suppose that 9w decreases slower than exponentially in W. More precisely assume that 
— log 9w = o(W). The logarithm of Ai(i) can be lower bounded as follows, 

log A 1 (i) > W \og(q/p) + log 9 W + o(W) 
= W\og(q/p) + o(W) 

The latter equality follows from the hypothesis. Now, if q/p > 1 (i.e., p < 1/2), then logAi(i) > Vt[W), 
which implies log[{J^)-% = Q{W). 

When p > 1/2, it follows that A 2 (i) = 0(1). In particular if p = 1/2 and J2%ij 2l °j = we can 
see from eq. (j30j) that A x {i) = and in turn, [(J^) -1 ]^ = 

Note that for p = 1/2 each of the % + 1 sums in the upper bound for A\{%) given by ([29]) is bounded 
by the 22-th moment of the set size distribution. Hence, if Y^=i3 2i ^j = 0(1), then [(J^) -1 ]^ = 0(1). 

Finally, when p > 1/2, an argument similar to that used in the case where 9\y decreases faster than 
exponentially yields [(JW) _1 ]« = 0(1). □ 



Appendix C 
Simplified bounds 

It is worth noting that A 2 (i) gives us a lower bound on [( J^) -1 ]^, as Ai(i) — A 3 (i) > 0. Furthermore, 
the convergence of A 2 (i) is given by the convergence of the sum Y^^q/pYOj. Therefore, we can write 



K^)" 1 ]* = n V L —Z )e j . (47) 




From that, we derive the following results. 

1) When 6 W decreases faster than exponentially in W. 

By definition, for any e > 0, there exists a W (e) such that \og(6w/0w+i) > 1/e. Given p > 0, we can 
choose e such that qe~ 1 ^ e /p < 1. The ratio test for convergence of an infinite sum reads 

(g/p) J+1 flj+i < Q^l (4g) 
(q/p) j 6j p 

Let a = qe~ l l e /p. Hence, there exists a j* such that for all j > j*, ((1 — p)/pY6j < a\ j = 1, 2, 

Therefore, the sum converges to a constant for any < p < 1, yielding [(J^) -1 ]^ = 0(1). 
2) When Q w decreases exponentially in W. 

By definition, there exists < a < 1 such that log 6^ = W log a + o(W). When p < a/(a + l) it follows 
that ((1 -p)/p) j 6j > a- j 6j = Therefore, [(J w ) -1 ]« = 0(W). A tighter bound can be obtained by 
taking into account A^i), yielding log[(jW)- 1 ]« = O(W) forp < a/(a + l) and [(J {e) y% = 0{W 2i+l ) 
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for p = a/(a + 1). On the other hand, for p > a/(a + 1), we have ((1 — p)/p) j 9j < a?9j = 0(1). Hence, 
[(jW)-i] = O(l). 
3) When 9 W decreases slower than exponentially in W. 

When p < 1/2, it follows that (1 — p)/p — a > 1. In this case, there exists a j* such that for all j > j*, 
((1 -p)/p) j 9j = a j 9j = Hence, = O(W) for p < 1/2. Conversely, when p > 1/2, 

(1 — p)/p — a < 1. Hence, there exists a j* such that for all j > j*, ((1 — p)/py9j = a^9j = 0(1). 
Thus, [( J^) -1 ]™ = 0(1) for p > 1/2. At last, for p = 1/2, the summation is exactly 1, which also 
implies [(J (0) )'% = 0(1). In the latter case (i.e., p = 1/2), a tigher bound is obtained by taking 
Axii) into account, which yields [(J (0) ) _1 ]ii = w(l) if £ j = l^j 2 ^ = w(l) and [(J^) -1 ]* = 0(1) if 

Ej = iV^ = o(i). 



Appendix D 

Asymptotic Efficiency and Asymptotic Normality of the MLE T*(§) 

In this section we show that there exists a Maximum Likelihood Estimator (MLE) T^(E>) of 0, that 
is asymptotic efficient (i.e., MSE(7 1 *(§)) = and asymptotic normal. Since the Delta Method 

is an exact approximation for the Normal distribution, it follows that there exists a MLE T*(§>) of 9i that 
is asymptotic efficient, which can be obtained by applying the Delta Method to T^(S). 

Consider the likelihood function in Eq. ([7]): 

w 

f{j\4>) = ^bji^i. 
i=i 

From the sum-to-one contraint on the parameters, it follows that <pi = 1 — 4>i- Thus we can rewrite 
the previous eq. as 

w 



i=2 



Hence, 



Al og/(j |0) = b g ^ 2 < k < w. 

From Theom. 5.1 [5, Chapter 5], we prove that there exists a MLE that is asymptotically efficient and 
asymptotically normal by showing that assumptions (A0)-(A2) and (A)-(D) are satisfied. 
Proof. (AO) Follows from ©. 

(Al) The support of 0, for 2 < i < W is < fa < 1 subject to Yn=2 0i < !■ 

(A2) Observations are assumed to be independent. 

(A3) Follows by the assumption that < fa < 1 for 2 < i < W. 

(A) We have 

<} f(j\4>) = b jk , 2<k<W 



90, 

and hence 

7^^4^-/010) =0, 2<k,l,m<W. 
d(p m d(pid(p k 

(B) The expectation of the first logarithmic derivative of / is 
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W h -h f W \ 

= Y\ w [bji + Ytibji-bjJfa) 

j= i b n + E l=2 (% - V i=2 / 

j=i i=i 
= l-6n 
= 0. 

As for the second derivative, we have 



Ej, 



d 

d<f) k 



log/ C7'|0) 



E 



d d 
■wr- log / ( j\ 0) j— log / (J 0) 



which is equivalent to 



j'=i (6,1 + y),- = ,(6« - 6,i)0j ) V »=2 



E 



(6ji-6ji)(6j fc -6ji) 



? 6ji + E l = 2 (6ji - 6ji)0i 



<9 2 



<90;<90 fc 



log/ 010) 



E 



(C) The vectors 



V V^ 1 

(bji - bji)(b jk - bji) 
t b n + E l = 2 (6ji - 6ji)0i 

5§- log/ W), gg- log /O|0), . . . , log /O10) 



ba + E,= 2 (^ - 6,1)00 V 1=2 / 



E 



independent with probability 1. Note that and b jk > 
It follows that for j > k > 2 



for 1 < j < W must be linearly 



j < k (in particular, bji > -<=>- j = 1). 



g 
<90 fc 



log/010) 



whereas for j < k, 



d 

d(/)k 



log/ 010) 



6jfc — 6ji 



& il + E l = 2 (6ji - 6,1)0; 

0, 



E 4 = 2 (6ji - 6ji)0i 



> 0. 



Therefore, the j — 1 leftmost entries in the j-th vector are while the remainder are positive. Hence 
the vectors are linearly independent. 
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(D) Consider a constant e, > such that f{j\4>) = bji + Y^ =2 {bji — bji)(pi > tj for 1 < j < W. Thus, 



<9 3 



d(j) m d(j)id(j) k 



-(b jk - bj^ibji - bj X ) x 2(b jm - 6,-i)0 m (6 il + E!= 2 (% ~ 



2(i> jA . - 6ji)(6j-i - &ji)(&jm - & 



< 



2(b jk - - bji){b jm - 6ji)0 r , 



f 3 



Since M Mm (j) 



a 3 



p m o<piU(p k 



f{j\4>) < oo, then ^[M Wm (j)] < oo for all k, I, 



m. 



□ 



Appendix E 
Average set size proofs 

Lemma E.l: Let p be the sampling probability and denote an unbiased estimate of the average size 
of the observed sets m^. Then, 



MSE(m ) = O 



(2) 2 
m 4> ~ m 4 

N 



Proof. The estimation error lower bound of the average set size is [|9l pg.83, Proposition 3] 



MSE(m^) > 



(l,...,W)(jW)- 1 (l,---,W) T -m; 
N 



Lemma A. 2 yields 



W k k 
k=l i=l 3=1 



»J I J) (*) (|) " (-l) 2 ^'^ - - 1)4(0) 



J](g/p) 2fc 4(0) X)" 



k=l 



k^\ q 1 — 1 



k\ q-i - 1 



w 



d 1 (<f>) + J2(Q/p) 2k dk(<l>) 



k=2 



1 — q\ k 



1-q 



)w 
y fc=i 



Now <|2]) yields 



i-1 



(50) 



(51) 



(52) 



i=l 
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and 



ir 



w w 



k=l 



Q 

k=l i=k ^ 
— 1 ( i )n k n i - k 



i=l fc=l 
W / i 



Q 



Using the relation 



yields 



i=i \fc=i 



Vr Uv-¥ = r' ■ i = 1, 

£-^\k) \ ix(ix + y)(x + y) % 2 , i > 2. 



ii- 



fAM,* ^ ipjip + q)(j)i 



k=l 



i=l 



Putting together (50), (51 ), and (p3J yields 



MSE(m ) > 2^ Zh Zt\ m <$> I l N 



i=l 



p(l - g*) 



which concludes the proof. 

Lemma E.2: Using the observed set sizes § = {Sk]^ =1 the following 



P 



L S fc =l 



iV 



is an efficient (smallest variance) unbiased estimator of m§. 
Proof. We start by noting that 

rrij, = [1, W}<fi = [1, W]B~ l d((j)). 



Denote z = [zi, . . . , z w ] = [1, W}B 1 . From Lemma A.l we have 



ir 



En 



3=1 



3=1 



-q/pYJ2 J 



i\ 1 — g J 



For z = 1 pi} yields = 1 and for 2 < i < W, 



Zi = (-q/p)'' 



1-q 



Therefore, 



q J 1 — q p 

\p,2,3,...,W] 
P 



(53) 



(54) 

□ 

(55) 



(56) 



(57) 
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Thus applying the above back into ( f56| ) yields 

m d 



m<t> 



V 



+ 1 



V 



di{4>), 



(58) 



where m d = Y^f=i ^ * s me expectation of average set size of observed subsets. Rewriting ( |58] ) using the 
set sizes § we get 



1 N 

-Y 

k=l 



1 



+ 1 

v V v 



Based on our assumption that {Sk}™^ is an i.i.d. sequence, we have that {<Sfc}fcLi is also i.i.d. with 
distribution d(<f>). Therefore, 



and 



Since 



and 



E[rh^\ = E 



Var[(m ) 2 ] = -Var 











. v 





v 



+ i 



L 5 fc =l 



£[<S fe ] =m d = y^Jdj((j)), 



i=i 



£[l 5fc=1 ] = d!(0), 



we have £?[m^] = from ((58]), which indicates that rh^ is unbiased. Then 



i=l 



E[(ls k =i) 2 } = d 1 {<f>), 



and 



yield 



Var[(m ) 



S[5 fc l5 fc =i] = di(0), 

^i(0) + ^Er=i4(0)A: 2 -m2 



From ( |50| ) and ( |5T| ) we find that is an unbiased estimator that achieves the Cramer-Rao lower bound 
(i.e., it is an efficient estimator). □ 

Lemma E.3: Let rh denote an unbiased estimate of the average set size m$. Then, 

WW . .,, T au_n W W 



MSE(m e ) > 



W W 



(59) 



4 = 1 j 



Proof. 
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where ^ = (1, . . . , W). Note that 



(60) 



VM V# 
V0 



E i7i 

1=1 
w 

E 



i=l 



77(1 — q h 



I -Ok 
T)(l - q k ) 



77(1- 
A; — nid 



b>H>) 



7^(1 - g fc )' 



Substituting eq. (61 ) in eq. (60), we have 



(61) 



MSE(me) * ££ 



VK W 

EE 



2 ^ EE 



^(i-^)(i-g*) 



w w 



+^EE 



j 1 



w w 



^(i-^'XW) 



i=i j 



4 = 1 J 



□ 



Similarly to what we did for eq. (22), we split eq. (59) into three pieces to analyze its behavior. 



w w 



MSE(m e ) > 



w w 



YY , , iJi , +m 2 eYY 



tr 2 



w VP 



2 ^ EE 



i[(^)- r 



C/3 
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A. Analysis of U\ 



w / \ 2k ( k /i \ 



1=1 jr' = l V ' V ' i=l J = l fc=l 

VK / x 2fc / k 



k=l yl 7 \i=l 
W / x 2fc / / \ -fc 



1 - 



^ k=l 
W 

i=i 
w 

Note that U\ is bounded by the second moment of the distribution 6*. 

5. Analysis of U 2 

2 

Note that U 2 = ^JA 2 (i). Therefore, we conclude that U 2 diverges if either 6w decreases exponentially 

i 

in W and p < a /(a + 1) or 0^ decreases slower than exponentially in W and p < 1/2. 
C. Analysis of U 3 

W W T/ -WV/ilN-n W / \ 2fc fc /,\ fc 



•J 



V v— 



r?(m e + -6>i). 
p 



Thus, 



f/ 3 = 2m e r](m g + -6^ 
p 



It is interesting to note that, counterintuitively, U 2 goes to infinity for certain values of p and while 
C/i and U 3 are always finite, even though the factor [{J^)~ l }ji that appears inside the double summation 
in U 2 is the same factor that appears multiplied by j and ji in U\ and U3, respectively. 
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D. Proof of Theorem 4.2 



Note that U\, U 2 and U3 are positive quantities and, moreover, MSE(m e ) > => U\ + U 2 > U 3 . We 

observe that U\ diverges if the second moment of 6 is infinite, U 2 diverges if Y^f=i (^) @j ~ * 00 as 
W — > 00, while U s is always finite. 
Proof. 1) When 6 W decreases faster than exponentially in W. 

In this case, the second moment of 6 is finite and the sum Y^f=i J @j = ^(1) for < p < 1. 

Therefore, MSE(m(S)) = 0(1) for < p < 1. 

2) When 6 W decreases exponentially in W. 

The second moment of 6 is still finite. However, we can show that the sum Y^f=\ (j, j ®j ^ s ^(^0 f° r 
p < a/(a + 1) and 0(1) for p > a/(a + 1) by using an argument similar to the one used in Section E of 
Appendix A. Hence, MSE(m(S)) = Vt(W) for p < a/(a + l) and MSE(m(S)) = 0(1) for p > a/(a+l). 

3) When 6 W decreases more slowly than exponentially in W. 

We can show that the sum 

Eji (jY ' 8j is tt{W) for p < 1/2 and 0(1) for p > 1/2 by using an 
argument similar to the one used in Section E of Appendix A. However, the second moment of shows 
up in U\ and it can be either finite or infinite. Although it does not affect the bound for p < 1/2, in which 
case we have logMSE(m(§)) = Vt{W), it does change the bound for p > 1/2. In particular, if p = 1/2 
and J2Y=ii 2d i = ^W' tnen M SE(m(§)) = w(l). On the other hand, if p = 1/2 and EjIiJ 2 ^ > 
then MSE(m(S)) = Finally, if p > 1/2, then MSE(m(§)) = as well. ^ 

□ 



Appendix F 
Useful identities 




